Early screening of colorectal cancer using feature engineering with artificial intelligence-enhanced analysis of nanoscale chromatin modifications

Colonoscopy is accurate but inefficient for colorectal cancer (CRC) prevention due to the low (~ 7–8%) prevalence of target lesions, advanced adenomas. We leveraged rectal mucosa to identify patients who harbor CRC field carcinogenesis by evaluating chromatin 3D architecture. Supranucleosomal disordered chromatin chains (~ 5–20 nm, ~ 1 kbp) fold into chromatin packing domains (~ 100–200 nm, ~ 100–1,000 kbp). In turn, the fractal-like conformation of DNA within chromatin domains and the folding of the genome into packing domains has been shown to influence multiple facets of gene transcription, including the transcriptional plasticity of cancer cells. We deployed an optical spectroscopic nanosensing technique, chromatin-sensitive partial wave spectroscopic microscopy (csPWS), to evaluate the packing density scaling D of the chromatin chain conformation within packing domains from rectal mucosa in 256 patients with varying degrees of progression to colorectal cancer. We found average packing scaling D of chromatin domains was elevated in tumor cells, histologically normal-appearing cells 4 cm proximal to the tumor, and histologically normal-appearing rectal mucosa compared to cells from control patients (p < 0.001). Nuclear D had a robust correlation with the model of 5-year risk of CRC with r2 = 0.94. Furthermore, rectal D was evaluated as a screening biomarker for patients with advanced adenomas presenting an AUC of 0.85 and 85% sensitivity and specificity. Artificial Intelligence (AI)-enhanced csPWS improved diagnostic performance with AUC = 0.90. Considering the low sensitivity of existing CRC tests, including liquid biopsies, to early-stage cancers our work highlights the potential of chromatin biomarkers of field carcinogenesis in detecting early, significant precancerous colon lesions.


Introduction
Colorectal cancer (CRC) is the third-most diagnosed cancer in males and second in females with over 52,000 annual US fatalities [1].Improvements in the detection of CRC at earlier stages and more effective primary and adjuvant treatment options have resulted in decreased mortality rates due to CRC over the past 30 years in the United States and other Western countries [2,3].Colonoscopy is the current gold standard screening modality, but attempting to perform colonoscopy on the entire average-risk population is ine cient, as only 7-8% have advanced adenomas.The direct visualization of adenomatous polyps within the eld of view of the endoscope offers excellent sensitivity to treatable, early-stage precancerous lesions and provides the opportunity to remove advanced adenomas (stage AA, size > 1cm or > 25% vilous features or high-grade dysplasia) that may later progress into invasive CRC.However, colonoscopy is hampered by patient noncompliance, the inconvenience of bowel preparation, the potential requirement for dietary and medical adjustments, the potential for sedation-related complications, and procedural risks of perforation, major bleeding, and infection [4,5].Current efforts to reduce CRC incidence and mortality, particularly for younger adults, are focused on identifying patients who warrant earlier screening through increased public awareness of cancer risk and symptoms and the development of early risk strati cation tools with high sensitivity and accessibility [3,6,7].
Among the different types of screening techniques are stool-based and blood-based tests.Stool-based testing includes fecal immunochemical test (FIT) and guaiac-based fecal occult blood test (gFOBT), which detects either blood or hemoglobin, and multitarget stool DNA test (sDNA-FIT, Cologuard), which is a molecular assay to test for tumor DNA mutations and methylation markers [8][9][10][11][12].Stool-based testing has the advantage of noninvasiveness and better patient uptake [13].Fecal tests have also been shown to decrease CRC incidence, albeit modestly [10].The sensitivity of FIT for AA is 21-25% [10].The Cologuard test combines FIT with KRAS mutation and 2 methylation markers with sensitivity of 42% for stage AA but is counterbalanced by lower speci city (and hence more false positives) and cost (~ 10 times the cost of FIT alone) [14].Recently, there has been signi cant interest in liquid biopsy tests which are capable of detecting genetic and epigenetic modi cations and fragmentation in circulating tumor DNA (ctDNA) [15,16].Companies including Grail, Freenome, Guardant, Del , and Thrive have actively developed liquid biopsy tests as a potential cancer screening modality [17][18][19][20][21][22][23][24][25].Their initial results demonstrated the capability to detect various cancers, including CRC; however, their sensitivity to early-stage disease dropped precipitously below a clinically acceptable level.The main limitation of such tests is due to the limited amount of DNA released by a tumor into circulation, with smaller lesions secreting less tumor ctDNA (~ 1 ctDNA/ 10 mL of blood) [26][27][28].For example, a recent study revealed that ctDNA was detected in 45% of CRC cases, whereas its presence was observed in less than 2.6% of advanced adenoma cases [29].The considerable heterogeneity in tumor cells complicates the evaluation of DNA fragmentation or speci c genetic/epigenetic changes in clinically accepted blood samples using liquid biopsy tests for detecting small lesions.Guardant's recent ECLIPSE trial showed a drop in performance from overall sensitivity of 83% for CRC to 13% for advanced adenoma [24].The Shield blood test that utilizes genetic, epigenetic, and proteomics from circulating tumor DNA demonstrated sensitivity of 91% in CRC, 20% in advanced adenoma with a speci city of 92%.Similarly low performance for screening advanced adenomas was observed with Freenome's recently published AI-EMERGE study (n = 664) with an overall sensitivity of 41% and speci city of 90%, which is decreased (sensitivity of 25%) when the size of the advanced adenoma is limited to less than 10 mm [30].A sensitive, accurate, accessible, and coste cient test that is not restricted by lesion size may therefore provide signi cant clinical value.A successful test design requires three crucial elements: an accessible biomarker source, a biomarker that is sensitive to advanced adenoma, and a modality that enables population-wide screening.
Here we explore eld carcinogenesis as an alternative biomarker source.Carcinogenesis involves the complex interplay between environmental exposures and genetic / epigenetic status.Field carcinogenesis is the process by which cells throughout the colonic mucosa accumulate carcinogenic alterations, and due to stochastic events, some of these give rise to a tumor clone.As cells throughout the colonic mucosa harbor these carcinogenic alterations, eld carcinogenesis can be utilized as a robust marker to assess the risk of neoplasia for the entire colon [31,32].Field carcinogenesis is the underpinning of the clinical practice of surveillance colonoscopy-performing more frequent colonoscopy in patients with a prior adenoma since they are at higher risk of developing new polyps throughout the colon.Flexible sigmoidoscopy allows cancer screening from a more accessible site, and identi cation of adenomas in the distal colon is associated with a 2.5-fold higher risk of proximal neoplasia [2].Several studies have shown the e cacy of exible sigmoidoscopy as a risk strati cation tool in cancer prevention and reduced mortality through utilization of eld carcinogenesis [33,34].Aside from these morphological markers, in the visually normal colonic mucosa rectal mucosa there are myriad cellular, physiological, genomic/proteomic, epigenetic, and molecular events that correlate with concurrent and future neoplasia [35,36].Cellular markers of neoplasia include increased proliferation and decreased apoptosis.
Physiologically, there is evidence of an early increase in blood supply potentially driven by metabolic changes (Warburg effect).There are multiple genes and proteins altered in the normal colonic mucosa.From an epigenetic perspective, both microRNA and methylation have been shown to be altered [36,37].
The occurrence of multiple synchronous and metachronous primary neoplastic development, and local recurrence can be well explained by eld carcinogenesis [35,37].Several studies were conducted on speci c epigenetic alterations such as hypermethylation of CpG island by Tahara et.al. and hypomethylation in LINE-1 by Kamiyama et.al. in CRC progression.Along with studies that directly examined gene and epigenetic alterations, other studies demonstrated that chromatin structural changes may also affect silencing of tumor suppressor genes [38].The dynamic chromatin structure, which modulates gene expression by controlling the accessibility of transcription factors (TF) and RNA polymerases (RNAPs), also holds potential to be utilized as a predictive tool for detection of early-stage cancer.
We explored 3D chromatin structure as a biomarker of colorectal carcinogenesis.Chromatin adopts a complex structure across multiple length scales.At the smallest scale, DNA wraps around histones to form nucleosome complexes colloquially known as "beads on a string."Nucleosomes and linker DNA then organize into disordered chains with diameters spanning from 5 to 24 nm that typically comprise 200-1,000 bp.The chromatin chain is packed at varying volume concentrations to form packing domains (PDs) with an average genomic size of approximately 200 kbp and average physical radius of around 80 nm [39][40][41][42].Within PDs, chromatin follows a scaling relationship between the number of chain monomers (N f ) and the space it occupies that is well approximated as a power law (N f ∝ r D ), thus exhibiting a mass fractal-like polymer conformation behavior.Accordingly, conformation of chromatin inside a packing domain can be characterized by chromatin density packing scaling exponent D, which provides insight into the physical nanoarchitecture of chromatin.PDs play a crucial role in transcriptional regulation.Gene transcription tends to occur at the periphery of PDs, and PD structure as well as genomic processes that regulate the emergence, maintenance, and dissipation of PDs have direct implications for the rates of transcriptional reactions and new transcriptional up-or downregulation [40].The dysregulation of chromatin PDs has been implicated in transcriptional alterations during carcinogenesis.For example, a higher value D of a domain is associated with lower gene connectivity scaling [43,44] and more frequent long-distance gene loci contacts [43,45].Presence of high-D PDs and greater packing domain upregulation have been causally linked with several transcriptional patterns prevalent in cancer cells, including transcriptional divergence (further upregulation of initially upregulated genes with simultaneous suppression of downregulated genes) [43], transcriptional malleability (enhanced rates of new transcriptional upregulation), and transcriptional intercellular heterogeneity (the standard deviation of expression of genes across a cell population).Taken together, these processes enhance the ability of cancer cells to attain new transcriptional states [42].Neoplastic cells may derive advantages from transcriptional plasticity as they must adapt and acquire new traits in response to different constraints and changes in the microenvironment and host responses [40,43].Consequently, chromatin 3D architecture can serve as a marker for the progression of neoplastic changes.
Changes in chromatin domain structure occur at various length scales, ranging from approximately 20 nm to 300 nm [46].Conventional optical microscopy lacks the ability to differentiate structures smaller than half the wavelength of visible light, which typically ranges from 400 to 750 nm.To overcome this limitation, we have developed an optical spectroscopic statistical nanosensing approach known as csPWS, or chromatin-sensitive partial wave spectroscopic microscopy.csPWS enables calculation of the packing scaling behavior of chromatin PDs within the nucleus, thereby enabling sensitivity to structural changes that are smaller than half the wavelength of visible light at a length scale sensitivity of 23-334 nm [40].This is accomplished by analyzing the spatial variations in the refractive index (RI) through spectroscopic analysis of the interference of scattered light within each diffractional resolution voxel [47,48].For a given cell, the output of csPWS microscopy is an image of a nucleus where each pixel represents the packing scaling behavior of chromatin PDs.This image highlights the structural heterogeneity within a coherence volume centered around each pixel.The packing scaling D is estimated by measuring the standard deviation of the spectra generated by the interference of light scattered by the spatial variations of the chromatin density and a reference wave and applying the framework provided in [49].Our optical statistical nanosensing approach enables a high throughput, robust, and reproducible characterization of chromatin organization and provides valuable insights into its structural properties at the nanoscale.
Prior studies have shown that although intra-domain scaling D is a powerful regulator of transcriptional plasticity, other properties of chromatin 3D structure may play a substantial regulatory or modulating role.
Factors including nuclear crowding density, genomic size (Nd) of a domain, domain volume fraction as a function of intranuclear (e.g., peripheral vs interior) location, interdomain interactions, histone modi cation in and outside of domains, and others may affect chromatin connectivity, accessibility, transcriptional malleability and heterogeneity, and ultimately global patterns of gene expression [40,42].These factors in uence the chromatin structure and its functional properties within the nucleus.The average nuclear packing scaling D does not fully capture the complexity of dynamic chromatin structural changes.Thus, advanced machine learning and arti cial intelligence (AI) deployed on csPWS images of cell nuclei can be utilized to more accurately capture the complexity of these chromatin properties.
In this study, we bridged eld carcinogenesis as a biomarker source and chromatin domain dysregulation as the biomarker with recently developed csPWS microscopy to develop and test a new approach to early CRC screening, where cells are obtained by brushing the rectal mucosa, followed by csPWS measurement of their chromatin structure with the resulting data being further analyzed with the help of machine learning.We evaluated chromatin structural alterations within and across PDs within cell nuclei of rectal cells, optimized cell acquisition and analysis, identi ed and optimized chromatin biomarkers of eld carcinogenesis, and tested the diagnostic accuracy of this approach for the identi cation of patients who harbor pre-cancerous advanced adenomas in the colorectal mucosa.The overarching goal of this pilot study was to develop a screening method for the early detection of CRC and advanced adenoma.

Patient Recruitment and Demographics
The study was conducted following a double-blinded design with recruitment at NorthShore University Health System, University of Chicago, and Indiana University.Of the 135 patients in our control group, 13 patients had hyperplastic polyps and 122 patients had other non-signi cant ndings, and our case group consisted of 13 patients with diminutive adenoma (DA), 15 patients with nondiminutive adenoma (NDA), 74 patients with advanced adenoma (AA), 9 patients with hereditary non-polyposis CRC (HNPCC), and 10 patients with CRC.Patient demographic information collected included age, gender, smoking and drinking history.To evaluate potential confounding factors, we performed analysis of covariance (ANCOVA) on both control and case groups (de ned as NDA, AA, Cancer) with the results shown in Fig. 1.The percentage of females was comparable between control (49%) and case (48%) groups.The proportion of smokers was slightly higher in the cancer population, whereas the percentage of drinkers was slightly higher in the control population.ANCOVA analysis did not show any signi cant relationship between gender, smoking history, or drinking history and chromatin packing scaling D. Age was signi cantly higher in the case group with a mean of 62 years old compared to the control population with a mean of 57 years old and showed a small negative correlation (linear regression coe cient = -0.008)with D using the linear regression model.This suggests a minimal in uence of age on rectal D, as a 10-year difference in age contributes to less than 7.2% of the variation in average D between the control and case populations, and, importantly, despite being on average slightly older, the cases had an elevated D compared to controls.

csPWS is sensitive to chromatin domain alterations associated with eld carcinogenesis
We investigated the in uence of eld carcinogenesis on chromatin structure by analyzing the packing scaling behavior of PDs of colonocytes brushed from different locations within the colorectal track.Our study focused on comparing samples obtained from the tumor site, normal appearing colonocytes brushed at locations 4 cm away from the tumor, and rectal colonocytes from patients with tumors.We observed a signi cant increase in D within nuclear chromatin domains in samples obtained from the tumor site, locations 4 cm away from the tumor, and the rectum compared to rectal colonocytes obtained from healthy controls (shown in Fig. 2a).However, no statistically signi cant differences were observed in D among the three tumor-associated locations (tumor, 4 cm away, and rectum).This suggests that our biomarker derived from rectal mucosa carries a distinct signature of eld carcinogenesis which is robust throughout the colorectal tract.
We assessed the effectiveness of rectal D as a potential biomarker for eld carcinogenesis.In a separate dataset (135 controls and 74 cases), we observed that both left-sided and right-sided adenomas displayed a statistically signi cant increase in rectal D compared to the control group (Fig. 2b).This nding underscores D as a robust biomarker that is not limited by the location of an adenoma within the colon and rectum.Overall, our ndings validate that chromatin structural changes measured by packing scaling D are indicative of eld carcinogenesis in early-stage CRC patients regardless of the exact location of an adenoma.

Chromatin PD alterations correlate with CRC risk
Prior studies on etiological eld carcinogenesis highlighted the role of a preconditioned " eld" in fostering transcriptomic, genomic, and epigenetic alterations that may lead to a neoplasm in the affected region.Therefore, the entire " eld of injury" may bear the molecular biomarker of carcinogenesis irrespective of proximity to a tumor.Our objective was to detect nanoscale chromatin structural changes and alterations in PDs of rectal histologically normal appearing colonocytes that may serve as biomarkers of carcinogenesis and are detectable by csPWS.Our ndings, as illustrated in Fig. 3, reveal a clear correlation between an increase in packing scaling D and colonoscopic ndings.The rectal D measured from patients with abnormal colonoscopy ndings (adenoma size > 5mm, HNPCC, or cancer) was signi cantly increased compared to rectal D measure from patients with a normal colonoscopy result.Speci cally, we observed a non-signi cant increase in rectal D for smaller adenomas, such as diminutive adenoma (polyp size < 5 mm, n = 13).However, a signi cant increase in D was noted in patients harboring nondiminutive/nonadvanced adenomas (5-9 mm polyps, n = 15) and advanced adenomas (polyp size ≥ 10 mm, high-grade dysplasia or > 25% villous features, n = 74).Moreover, rectal D was further elevated in patients with hereditary nonpolyposis colorectal cancer (HNPCC, n = 9), characterized by a lifetime risk of CRC ranging from 60-80%, as well as in patients with colorectal cancer (n = 10).Rectal D mirrored current and past colonoscopic ndings and progressively increased from the low-risk CRC group to the high-risk CRC group: control < control with high-risk history < no-risk history with advanced adenoma < low-risk history with advanced adenoma < high-risk history with advanced adenoma (Fig. 4a).These results indicate that an increase in the putative biomarker has a robust correlation with the severity of precancerous lesions and CRC elsewhere in the colon.
To assess the relationship between the dysregulation of chromatin PD in eld carcinogenesis and the risk of CRC, we developed a ve-year CRC risk model re ecting different stages of tumorigenesis (Fig. 4a).Rectal D effectively mirrored the risk of CRC progression.A statistically signi cant increase in rectal D was observed in high-risk advanced adenoma (effect size = 0.83), low-risk advanced adenoma (effect size = 0.79), and high-risk control populations (effect size = 0.75) compared to low-risk and control populations without a history of CRC (Fig. 4a).Furthermore, regression analysis (Fig. 4b) revealed a positive correlation between packing scaling D and ve-year CRC risk, demonstrating a strong correlation (r 2 = 0.95).These ndings demonstrate a robust and signi cant correlation between the dysregulation of chromatin in rectal colonocytes and the risk of CRC progression.The effectiveness of leveraging average packing scaling D in the detection of dysregulation of chromatin PD that may eventually contribute to the development of CRC provides the rationale for its use as a biomarker for CRC screening.csPWS-measured rectal D is sensitive to advanced adenomas throughout the colorectal tract.
We obtained rectal brushings from the histologically normal mucosa of patients prior to colonoscopy (135 control, 74 advanced adenomas).The dataset was 50/50 split for prediction rule development and prospective testing.In the testing set, 0.85 sensitivity and 0.85 speci city with AUC = 0.85 were observed for control patents vs those with advanced adenomas located elsewhere in the colon.One crucial aspect that many early screening tests for CRC must consider is whether sensitivity is maintained for small lesions.We evaluated the proportion of advanced adenoma patients with different polyp sizes to test whether rectal D is limited by tumor load or lesion size (Fig. 5).The majority of the advanced adenoma lesions (78.4%) were under 1.5cm while only 5.4% were over 3 cm in size.
AI-enhanced csPWS analysis of chromatin alterations in rectal colonocytes provides improved diagnostic performance for detection of advanced adenomas.
The complex link between physical chromatin organization and genetic/epigenetic alterations in early cancer development includes the association between gene expression and packing scaling D.
Transcription involves a series of chemical reactions that are modulated through the balance between reaction rate constant and molecular accessibility of transcriptional reactants (RNA polymerase, transcriptional factors, etc.) and are affected by the local chromatin environment within packing domains.Leveraging recent advances in AI, speci cally using convolutional neural networks, we utilized transfer learning paired with dimensionality reduction with an autoencoder network to better capture this complexity.The representative features were used on a random forest classi er, and the performance of the trained model was evaluated using the repeated strati ed cross-validation sets (75/25 training/testing split).Optimal sensitivity and speci city values were selected based on the cut-point on the AUC curve that maximizes the number of correct classi cations within each cross fold.Enhanced diagnostic performance in differentiating control and case populations was observed with AUC of 0.90 (+/-0.06),0.88 (+/-0.08)sensitivity, and 0.85 (+/-0.09)speci city (Fig. 6).We also evaluated the diagnostic performance of the AI model for different endpoints (Table 1).Identical network structure was applied to different datasets with different subgroups categorized into controls and cases.These results show that AUC from our cross-validated model maintains robust diagnostic performance across different stages of CRC progression.

Discussion
Most CRCs arise from adenomatous colon polyps that progress into advanced adenomas and then to carcinomas.Screening from 45 years of age is recommended for average-risk patients with the goal of improved disease prognosis by identifying an early stage of CRC that is more treatable, resulting in a reduced mortality rate [10].Among the multiple tests available for early screening of CRC, each modality has its own limitations: stool-based tests have low sensitivity, CT colonography involves radiation, and endoscopy is costly and requires bowel preparation and sedation [4,5].The novel liquid biopsy tests that are being developed by companies including Grail, Freenome, Guardant, Del , and Thrive show excellent results in detection of CRC but suboptimal performance in identifying advanced adenoma [17][18][19][20][21][22][23][24][25].This suboptimal performance in detecting early-stage cancers and precancerous signi cant lesions can be attributed to the biological nature of the biomarker source.Circulating cell-free DNA as a biomarker poses the limitation of its small fraction within the peripheral blood, with the majority of DNA originating from hematopoietic cells [26,29].Early-stage cancer development is likely associated with a smaller oncogenic load that will result in a smaller amount of biomarker released into the blood stream.It is thus likely that sensitivity to early oncogenesis drops due to fewer genetic/epigenetic alterations being accumulated, resulting in less heterogeneity within the clonal expansion process that these tests utilize.
To overcome the loss of sensitivity for smaller lesions that plague many current tests, we explored an alternate biomarker source and type.Our results suggest that eld carcinogenesis is a promising biomarker source for early CRC detection.Field carcinogenesis implies that extensive epigenetic alterations preceding dysplastic changes are not limited to the tumor site alone but encompass the entire " eld of injury", regardless of the tumor load [50].Thus, by combining the biomarker source of eld carcinogenesis and biomarker of chromatin structural changes, we expect a highly sensitive and reliable modality for identifying early CRC development.
Previous studies show both experimentally and computationally that chromatin packing scaling D, the size of chromatin domains, and chromatin density all affect the local macromolecular crowding and may play a crucial role in the regulatory mechanisms expressing phenotypic plasticity [40,43].It has been proposed that an increase in phenotypic plasticity of gene expression can be linked to carcinogenesis with potential mechanisms involving neoplastic cells' increased chance of survival in response to external stressors by modulation of transcriptional malleability and intercellular transcriptional heterogeneity.Detection of chromatin conformation biomarkers can be achieved using csPWS, a highthroughput optical nanosensing technology that enables nanoscale detection of changes in chromatin domain conformation.Leveraging the clinical protocol that was developed for cell acquisition, storage, and shipment, a reliable measurement of physical characteristics of the chromatin structure can be performed on rectal colonocytes obtained from rectal brushings.With a length scale sensitivity of 23-334 nm, csPWS is optimized to sense PDs (average size of ~ 200 kbp), in which chromatin packing behavior can be characterized by packing scaling D.
Our ndings demonstrate utilization of eld carcinogenesis in CRC as a powerful tool for early colorectal cancer detection.We showed that rectal D measurements using csPWS are sensitive to eld carcinogenetic and can be leveraged to differentiate healthy patients from those who harbor adenomatous lesions within the entire colon.Dysregulation of chromatin PD in colonocytes obtained from normal-appearing rectal tissue in patients with CRC, as well as those located 4 cm away from the tumor showed an increase in D compared to colonocytes from control patients.Our data show that rectal D is increased in patients harboring adenomas regardless of their location, at distal or proximal colon tract.These results suggest that chromatin biomarkers of eld carcinogenesis can be obtained from rectal colonocytes.We con rmed the relationship between rectal D and the risk of progression to CRC via development of a risk strati cation model based on colonoscopy ndings.We developed a model of 5year risk of progression to CRC based on colonoscopic ndings and found a robust correlation between the dysregulation of chromatin in rectal colonocytes and the risk of progression.These results indicate that chromatin PD changes within the nucleus of rectal colonocytes mirror changes throughout the colon, demonstrating the potential of our proposed marker for early CRC screening with easy accessibility via rectal colonocyte brushings.
Our initial univariate analysis of using the nuclear average of packing scaling D of rectal colonocytes as a sole biomarker showed the ability to differentiate patients harboring advanced adenomas from control subjects with AUC = 0.85.However, the average rectal D of chromatin packing domains may not fully capture the complexity of the interplay between chromatin conformation and regulation of gene expression.Domain size, chromatin volume concentration, domain volume fraction, histone marks, interdomain structure, and other properties of 3D chromatin structure have been shown to modulate the PD regulation of transcriptional plasticity.Consequently, we utilized an AI-based feature engineering approach to better capture the key information that chromatin structural changes may present [42].Our AIbased model leverages the power of deep learning algorithms, speci cally through transfer learning pretrained on a large dataset from ImageNet.The transfer learning network enables the extraction of features with information that may be di cult to attain through different analytical approaches.Our network utilizes dimensionality reduction using an autoencoder to optimize the features more representative of our data.The resultant features were then passed onto our binary classi cation model for differentiating healthy from those with advanced adenoma.Our model's robustness was validated using repeated strati ed 4-fold cross-validation.The diagnostic performance was evaluated with AUC, sensitivity, and speci city metrics with excellent results of AUC = 0.90(+/-0.06),sensitivity = 0.88(+/-0.08),and speci city = 0.85(+/-0.09)for advanced adenoma.We should note that the sensitivity and speci city were selected based on the optimum point on the AUC curve within each cross fold.We would like to emphasize that a majority of the adenomas that were measured in our study were small in size (< 1.5cm), adding immense clinical value in the early prediction of CRC.Implementation of our model to the advanced adenoma subgroups based on lesion size showed comparable results with the accuracy of correctly identifying as harboring advanced adenoma from 81-83%.As our model is not dependent on tumor load, early changes manifested in chromatin nanostructures under prolonged eld injury may serve as a new opportunity for a sensitive early screening tool.
We have shown that the clinical protocol of rectal colonocyte acquisition and csPWS imaging, further aided by AI-based feature engineering, can provide a sensitive modality for the detection of advanced adenoma.Our study was constrained by certain limitations, however.The study recruited a limited number of patients; therefore, it cannot provide a de nitive evaluation of our approach's performance.All subjects were undergoing screening or surveillance colonoscopy; however, the ratio of cases compared to healthy control in our study are notably higher than the disease prevalence among the screening population.Future risk prediction modeling can be extended from the current study once our model is shown to be robust across different demographic populations with larger-scale recruitment.The possible impact of other confounding factors such as age, dietary and lifestyle habits should be further evaluated, and any effect of potential small debris or mucus on the csPWS signal may also be investigated.

Patient Recruitment
All studies performed and samples collected were under the approval of the Institutional Review Board at NorthShore University Health System, the University of Chicago, and Indiana University.All methods were performed in accordance with the relevant guidelines and regulations and written informed consent was obtained from all participants undergoing screening or surveillance colonoscopy.The exclusion criteria for recruitment included incomplete colonoscopy due to failure to visualize the cecum or patients with coagulopathy, past medical history of pelvic radiation, or systemic chemotherapy.Patient demographic information including age, sex, smoking and drinking history were gathered.The diagnostic criteria for each and all subjects were made by a board accredited GI specialist and pathologist based on colonoscopy and pathology reports.

Sample collection and shipment
All sample acquisitions were adherent to the following minimally invasive protocol: colonoscopy to cecum was performed with standard techniques using Olympus 160 or 180 series or Fujinon colonoscopes.A sterile cytology brush (Cytobrush, CooperSurgical, Inc., Trumbull, CT, USA) was passed through the endoscope after insertion into the rectum, and gentle pressure with rotation of bristle was applied to the rectum.A single cytology brush was used for each patient, and the tip of the brush was clipped and immediately immersed in 1.5 mL vile tube lled with 750 mL of 25% ethanol.The samples were packaged and shipped to Northwestern University on the same day.Temperature was maintained below 10°C with polar pack refrigerant gel (SONOCO Thermosafe, Arlington Heights, IL, USA), and packaging was adherent to guidelines provided by the Department of Transportation with a primary and secondary container with absorbent material.

Sample deposition and preparation
All sample deposition preparation were performed by an investigator blinded to patient information: Within 24 hours of sample acquisition, the brush was smeared onto two microscope glass slides (Fisher Scienti c, Hampton, NH, USA), which were then xed in 95% ethanol for 30 minutes.The slides were examined under a bright eld microscope to nd cells deposited onto the cytology slide consisting of different types of cells including epithelial cells, red blood cells, and in ammatory cells.All measurements were taken from columnar epithelial cells as identi ed by standardized hematoxylin and cytostain staining protocol.Samples with su cient columnar epithelium free of crest, fold, cell debris, and mucus were only included in the study and imaged with csPWS.Based upon power analysis performed with con dence interval (CI) on average D restricted to be less than 5% of the difference between control and case populations, the minimum number of cells collected was set to > 30 cells per patient.
csPWS Instrumentation and Imaging The csPWS instrument was built on a commercial microscope (Nikon Instruments, Melville, NY, USA) with modi cations to include a Xenon lamp (Oriel Instruments, Stratford, Connecticut, USA).The spatially incoherent white light was focused onto the sample and a back-scattered image is projected through a liquid crystal tunable lter (Cri, Woburn, MA, USA) with a spectral resolution of 7 nm and further onto a CCD camera (Princeton Instruments, Trenton, NJ, USA).Monochromatic spectrally resolved images of wavelengths within 500-700 nm (at 2 nm increments) are acquired with the resulting data stored in an image cube (x, y, λ) and normalized by the reference wave acquired at a blank region on the slide.We used a moderately small numerical aperture (NA) of light incidence of 0.6, and light collection NA of 0.8 for csPWS to produce a uniform intensity across the sample plane.csPWS achieves sensitive but nonresolvable sub-diffraction length scale of chromatin in the range of 23-334 nm.Within the nucleus, the refractive index (RI) is proportional to the local macromolecular density (r) mainly consisting of protein, DNA, RNA, and others.The refractional increment is constant and mainly contributed by chromatin and nearly independent of the chemical constituents.
The readout of PWS microscopy is the image of a cell that captures and quanti es spatial uctuations in macromolecular density via evaluating the standard deviation of the interference spectra (S) between the spectrum of the reference wave and the scattering caused by the spatial variations of () across different wavelengths.The value of S is proportional to the Fourier transform of the autocorrelation function (ACF) of (), which is integrated over the Fourier transform of the coherence volume.
Coherence volume was de ned by the spatial coherence in the transverse direction (458×458 nm 2 ) and the depth of eld in axial direction (~ 3 µm).Consequently, the range of length scale sensitivity of the spectral interference signal and Σ depend on the illumination and collection geometry of the instrument, in particular their numerical apertures and the spectral bandwidth.We chose these instrument parameters to maximize the sensitivity of the interference signal to the length scales relevant to chromatin conformation within packing domains.As the fundamental unit of PDs is the 5-20 nm chromatin chain, the average domain diameter is 160 nm, and larger domains approach 400 nm in diameter, the instrument parameters were chosen such that the interference signal is predominantly sensitive to chromatin density variations at length scales from approximately 23 to 334 nm.For each intranuclear location (x,y), S(x,y) was used to calculate chromatin packing density scaling D(x,y) using the previously reported algorithm [49].In particular, we employed an analytical framework that integrates nite difference time domain simulation and experimental results to determine the packing scaling parameter D for each pixel within a 458 nm by 458 nm area based on S [35].Chromatin is the strongest contributor to the csPWS signal within the nucleus, as most other mobile macromolecules are outside the lengthscale sensitivity of csPWS.In this analytical framework, the packing scaling parameter D was calculated by tting the mass-density autocorrelation function (ACF) obtained from S measurements in PWS to the ACFs obtained from ground truth measurements of chromatin structure in lung adenocarcinoma A549 cells and differentiated BJ broblasts using chromatin transmission electron microscopy (ChromTEM) images [49].In short summary, the S(x,y) is proportional to the spatial ACF of the mass density distribution, B(r), convolved with a smoothing function S(r), which is characterized by the optical system setup and the source spectrum.We should note that S(r) thus depends on various factors including numerical aperture of the microscope, sample characteristics of the cell such as density of chromatin and macromolecular crowding, chromatin volume concentration, genomic lengths, and sample-glass interface characteristics such as forward and reverse Fresnel re ection and transmission coe cients and refractive index of media and nucleus.A model parameter D b that describes the shape of B(r) can be obtained for each given S within each coherence volume, which enable us to calculate the packing scaling D using the following relationship.
The estimation of packing scaling D took into account the in uence of chromatin volume concentration ϕ and genomic size Nf of packing domains.By considering these factors, the framework allowed for a more accurate determination of the packing scaling behavior within the chromatin structure.

Evaluation of average packing scaling D
We investigated the in uence of eld carcinogenesis on the packing scaling behavior of chromatin PDs within the nucleus of rectal mucosa.We compared a total of 201 patients, comprising three groups: controls (n = 136), patients with right-sided adenoma (n = 27), and patients with left-sided adenoma (n = 38).Tissue samples were collected from various distances relative to the tumor tissue, including samples obtained directly from the tumor as well as tissues located 4 cm away from the tumor and rectum.These samples were compared to tissues collected from a healthy control population.Using PWS microscopy, we quanti ed the average packing scaling parameter D in the nucleus of rectal mucosa for each sample group.By comparing these values across different distances from the tumor and with the control group, we aimed to assess the impact of eld carcinogenesis on the chromatin PDs within the rectal mucosa.

CRC 5-year risk
In addition to our investigation of chromatin PDs, we also developed a CRC risk model that aims to estimate the cumulative 5-year risk of developing CRC for different populations based on their baseline colonoscopy and follow up surveillance colonoscopy.The risk model is built upon published data from a consensus update provided by the US Military-Society Task Force and a study by Pinsky et.al. on surveillance.To construct the risk model, we divided the study population within our dataset into three categories: no history, low-risk history, and high-risk history based on past surveillance colonoscopy ndings.By considering both baseline colonoscopy and current colonic health, we developed a cumulative 5-year risk model by incorporating the following factors: annual risk of nonsigni cant nding or diminutive adenoma progression into advanced adenoma, the annual risk of CRC progression from advanced adenoma, and the risk of developing metachronous CRC into the model.
where Na is number of patients with no history or history of adenoma, Nc is number of patients with history of cancer, AAr is the cumulative risk of developing future advanced adenoma, AAàCRC is the risk of AA to CRC, and CRCm is the cumulative risk of developing metachronous CRC.It should be noted that we follow the results from US Military-Society Task Force that the risk progression in CRC depends both on sex and age, therefore calculating individual annual risk progressions in different sub-categories (male vs female, age below and above 80 years old).The annual risk progression from AA to CRC is converted into cumulative risk using the following formula.
By incorporating these key factors, our risk model provided a tool for a comprehensive evaluation of the impact of packing scaling D and chromatin structural changes during the progression and development of CRC, including early stages such as adenoma.We leverage this 5-year cumulative risk model as a reference to evaluate whether rectal D is sensitive to eld carcinogenesis, not restricted to the active level of dysplasia but also to the past colonoscopy results representative of eld injury on the system.AI analysis of packing scaling D. AI was employed to assess the potential of packing scaling D as a putative biomarker for early detection of CRC and advanced adenoma.A deep learning approach was leveraged to capture the complex relationship between D, a physical descriptor of chromatin organization, and oncogenic transformation.
Our AI-driven approach consisted of four steps: nucleus segmentation, preprocessing, feature learning, and classi cation (shown in Fig. 7).Nucleus segmentation was conducted by a trained investigator using custom software with graphic user interface, while remaining blinded to the patient information.The segmented D images on nuclei were resized and subjected to min-max normalization during the preprocessing step.
For feature learning, we employed a transfer learning approach with ResNet50, a convolutional neural network (CNN) pretrained on ImageNet database.Features were extracted from the nal convolutional layer of the CNN architecture.To enhance data representation and computational e ciency, an autoencoder network was implemented.The autoencoder was trained to minimize the optimal loss, and the encoder output served as representative feature.
In the classi cation step, a binary classi cation using a parameter-tuned random forest classi er was implemented on the training set to distinguish the healthy control population from the case population with advanced adenoma.The classi er model was ne-tuned through grid search, exploring multiple con gurations, and selecting one with minimal error on our dataset.To robustly evaluate our performance on relatively small dataset, we employed a repeated strati ed 4-fold cross-validation method with ve iterations to compute our diagnostic performance on metrics including area under the curve (AUC), sensitivity, and speci city.Optimal sensitivity and speci city values were selected based on the cut-point on the AUC curve that maximizes the number of correct classi cations within each cross fold.By repeatedly splitting the data into four folds and iteratively evaluating the results, we obtained reliable estimates of our diagnostic performance across different subsets of the dataset.This rigorous evaluation method enhances the generalizability and reliability of our ndings.

Table 1
Diagnostic performance of AI model at different endpoints.An important question is whether AI-enhanced csPWS is robust for identifying patients harboring advanced adenomas regardless of size.Implementing the previously discussed AI-enhanced analysis on subgroups of advanced adenoma based on lesion size (< 1cm, 1-1.5 cm, and > 1.5cm), a comparable classi cation performance was achieved for lesions of different sizes.With a xed speci city of 0.88, the sensitivity of successfully identifying advanced adenoma ranged from 0.81 to 0.83 (Table2).Our AIenhanced csPWS thus demonstrated the ability of our proposed biomarker to detect small lesions by leveraging the characteristics of eld carcinogenesis, enabling early detection of CRC and advanced adenoma.